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Abstract 

There is a great need for improved statistical sampling in a range of physical, chemical and biological systems. 
Even simulations based on correct algorithms suffer from statistical error, which can be substantial or even 
dominant when slow processes are involved. Further, in key biomolecular applications, such as the determination 
of protein structures from NMR data, non-Boltzmann-distributed ensembles are generated. We therefore have 
developed the "black-box" strategy for re-weighting a set of configurations generated by arbitrary means to produce 
an ensemble distributed according to any target distribution. In contrast to previous algorithmic efforts, the black- 
box approach exploits the configuration-space density observed in a simulation, rather than assuming a desired 
distribution has been generated. Successful implementations of the strategy, which reduce both statistical error 
and bias, are developed for a one-dimensional system, and a 50-atom peptide, for which the correct 250-to-l 
population ratio is recovered from a heavily biased ensemble. 

Ensemble averages over configurations are fundamental to the analysis of finite-temperature systems of physical, 
chemical, and biological interest, as well as to any statistically defined system. Yet it is well appreciated that 
estimates of such averages based on computer simulations can suffer from both systematic and statistical error [1,2]. 
We therefore ask: Given a set of previously generated configurations of uncertain quality, what is the best way to 
estimate ensemble averages? Our proposed answer, the "black-box re- weighting" (BBRW) strategy described below, 
appears promising in its ability to overcome both types of error in some systems. 

Statistical error is a ubiquitous problem of under-appreciated practical importance. Every algorithm known to 
the authors, including sophisticated methods [3-5], relics on repeated visits to a state (a subset of configuration 
space) in order to generate statistical reliability or precision in the population estimate for that state. If we define 
the simulation-and-system-specific correlation time i corr as the time required to visit all important states at least 
once, then statistical precision requires a long total simulation time, t s ; m » t cm . Standard square-root-of-duration 
arguments [1,2] suggest that a simulation retains a fractional imprecision of \A C orrAsim (on a unit scale). Below, we 
show that BBRW dramatically cuts statistical error, avoiding the slow square-root behavior. 

Systematic bias is typical in some systems of great practical importance — such as full-sized proteins — where, to 
date, it is not clear that any atomically detailed simulation has come close to reaching t cor r- Indeed, surprisingly 
lengthy simulations are required to obtain statistical ensembles for small peptides [6]. Nevertheless, biased non- 
Boltzmann distributed sets of atomically detailed protein structures are regularly generated, e.g., for NMR structure 
determination [7] and in the study of protein folding [8] . These sets can be useful for docking [9] , which is employed in 
drug design. The proposed BBRW strategy, in principle, can convert such sets into statistically distributed ensembles. 



1 Background and theory 

The fundamental basis of our re- weighting approach is the recognition that a simulation's output (the configurations 
generated) contains valuable information that is not conventionally used to estimate ensemble averages. This infor- 
mation is the observed configuration-space density. It is critical to note that the observed density never matches the 
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targeted distribution (e.g., proportional to a Boltzmann factor) because of (i) the omnipresent statistical error, and 
also possibly, (ii) errors in the algorithm or software. Our approach can treat both statistical and systematic error 
because it is blind to the source of error. 



1.1 Theory 

Standard re- weighting theory assumes that configurations in a simulated ensemble are already distributed according 
to a known distribution, typically proportional to a Boltzmann factor [10-14]. The novelty of our re- weighting 
method is that we treat configurations as if they were generated by an unknown process, a "black box." Thus, any 
ensemble can be treated since the results of the re-weighting do not depend on the specific process that was used to 
generate the ensemble. 

Our goal is to estimate ensemble averages and relative weights for configurations Xj — denoted by j — in the target 
distribution P(j), regardless of how the configurations were generated. While our approach will be general, we will 
assume for concreteness that the target ensemble is a classical system with potential energy U(x). In this case, the 
targeted probability density is P(j) = Z~ 1 e~ u ( Xj ^ kBT , where Z = J dxe~ u ^ x ^ kBT is the configurational partition 
function at temperature T. Because Z is typically not known, it will prove useful to define the un-normalized 
distribution 

P(j) = e- v W"> T oc P(j) . (1) 

The overall partition function Z will not affect our quest to assign relative configurational weights. 

The sample to be analyzed consists of a set of N configurations {x\, a?2, ■ ■ ■ , xn} which may have been generated 
by an arbitrary simulation protocol. This set of configurations is distributed according to the observed probability 
density P ohs (j), by definition. In general the observed and targeted distributions will differ, P° hs (j) ^ P(j), due 
to statistical and/or systematic error as explained earlier. It will again prove useful to consider an un-normalized 
density denoted by p obs (j) oc P obs (j). 

With the targeted and observed distributions defined, the straightforward logic of standard re- weighting [11] may 
be applied. In an ideal sample, an infinitesimal configuration-space volume dx about j would contain a quantity of 
configurations proportional to the targeted distribution p(j)dx. However, in actuality, p° hs (j)dx is obtained. The 
erroneous sampling can be corrected to cancel the error, by applying a relative weight 

™ hh (j)=p(j)/p ohs ti) (2) 

to each configuration. For the physical systems considered here the target distribution is given exactly by the 
standard Boltzmann factor of Eq. (JT|) . We therefore assume p is "known" in the sense that it can be evaluated for an 
arbitrary configuration. Then, because Eq. Q is exact, the key to practical application of this equation is accurate 
estimation of p ohs . 

Note that the relative weights of Eq. ^ will only be applied to the N configurations already generated — and 
therefore do not report on regions of configuration space not sampled. 

Ensemble averages of any function /, in our approach, are given by a weighted sum over sampled configurations 

N , N 

(/>bb = E wbb 0')/0')/^ w ' bb(j) - (3) 

The average ([3| differs from what we term a "canonical average" of the same configurations, where one assumes 
p obs (j) oc p(j); then w hh (j) is a constant, and averages reduce to the usual un-weighted estimate (/) can = fU) /N 

[2]. Note that the proportionality constants for p(j) and p ohs (j) cancel in ^ and thus do not affect the average. 

To see why it is better to re- weight samples when p ohs can be estimated well, consider the extreme case of a discrete 
system with six equi-probable states s (e.g., a cubic die). Of course, all ensemble properties can be calculated by hand, 
but what if statistical sampling were relied upon? Assume N — 30 samples were generated (either by a biased or 
unbiased procedure), yielding the frequencies (8,4,2,4,7,5), proportional to p ohs (s) for s — (1,...,6), respectively. 
Clearly, the canonical assumption p obs oc p will lead to substantial errors in ensemble averages. However, the 
frequencies of observation will exactly cancel p ohs when the relative weights w hb are used in ^ — summing over all 
30 "configurations" — to yield exact ensemble averages. The extension of this simple idea to continuum systems is 
the subject of this report. 

Several additional points should be made regarding the strategy embodied in ^ and ([3]): (i) It is an analysis 
scheme, treating the same set of configurations as might be analyzed canonically; the approach is not capable of 
predicting populations for regions of configuration space that are not in the observed ensemble, (ii) The observed 
relative probabilities {p° hs (j)} will always differ from those intended (e.g., canonical Boltzmann factors) due to 



2 



Figure 1: One-dimensional potential for a "thought experiment." The energy is constant and equal in both states, 
but the right state is 10 times as wide as the left. 

statistical error — and perhaps significantly, (iii) The relative weights ^ are valid regardless of the specific process 
used to generate configurations, (iv) The approach is particularly suited to estimating conformational free energy 
differences, as described below, (v) As in the single histogram approach [10], the BBRW method retains the ability to 
re-weight configurations into an arbitrary target ensemble p. (vi) The principal practical challenge in implementing 
our strategy is the estimation of p obs . 

1.2 Free energy differences 

To provide a proof of principle for our re- weighting approach we will determine free energy differences (AF), between 
states for various test systems [15-21]. 

A well established canonical technique for determining AF is to use ordinary simulation methods (e.g., molecular 
dynamics) to generate an ensemble which is distributed according to the desired distribution p. Then, the free energy 
difference is defined as 

e-^ kBT = Z B /Z A =N B /N A , (4) 

where Z s = J Se s dx e ~ u ( x )/ k BT j g CO nfigurational partition function for state s with coordinates given by x, and 
N s is the number of configurations observed in state s. 

For our re- weighting approach, we do not assume that configurations are distributed according to p. Instead, 
each configuration must first be assigned a relative weight using Eq. (J2|; then AF is estimated based on summing 
these weights in the respective states: 

e -A F/kB T _ Zb/Za = £ w b b(j) / £ M bb (i) (5) 

j e B ' j e A 

where Y] ■ £ s represents a sum over all relative weights w hh (j) in the state s. 

1.3 One-dimensional thought experiment 

The central idea behind our re-weighting approach is that one can utilize the observed configuration-space density 
to re-weight an arbitrary set of configurations into any desired ensemble. Because this idea is so fundamental to the 
present discussion, and apparently novel, it is useful to illustrate the idea using a one-dimensional potential. 

Consider the equal-energy double-square-well system depicted in Fig. [I] for which we would like to estimate the 
relative populations of the two states. Assume that the barrier is so high that it cannot be crossed in a conventional 
simulation of affordable length, but that we can run independent simulations in each state (a common situation for 
biomolecular systems when multiple structures are known). The states' widths are in the ratio 1:10. 

After generating, say, equi-sized trajectories in each state, a conventional view would hold that no analysis of 
the relative populations is possible since the barrier has not been crossed. However, since the right well is ten times 
wider than the left, the observed configuration-space density in the right well will be, on average, one-tenth that in 
the left. Combined with knowledge of the two ensemble sizes, this is sufficient information for calculating the state 
populations. 

More explicitly, one can estimate the observed density by counting configurations in equi-sized histogram bins, 
leading to p ohs (j) oc nj,(j), where n&(j) is the count in the bin containing configuration j. In this example, p(j) — const 
for all j because the energy is constant over both states. Now, by using Eq. Q in Eq. the sum over relative 
weights leads to exact cancellation of rib for each bin. Thus, this purely entropic case is correctly handled, ultimately 



3 



yielding simply the ratio of the numbers of (equi-sized) bins in each state — i.e., the ratio of state widths. Below, we 
will see that the "easier" energetic effects are also treated correctly in less trivial examples. 

The key point is that the observed density p obs , combined with knowledge of the desired ensemble p, provides 
sufficient information to deduce the state populations, regardless of the sampling bias. Perhaps equally importantly, 
we note that the density can be estimated in different ways besides binning, e.g., using configuration-space distances 
between nearby configurations. 

The reasoning used to correct systematic bias can be applied equally to statistical error. Assume now that we 
possess a single continuous trajectory which has crossed the barrier in Fig. [I] at least once. The population estimates 
will be statistically flawed due to finite sampling. Yet, our re- weighting analysis will still yield correct results since 
the approach does not depend on the source of the bias in the data. 

Below we will show that such density information is both present and often accessible in high-dimensional systems. 
We emphasize that the information is always present in principle and does not depend on prior knowledge. 

2 Results and discussion 

How can the observed density be computed in a way that will be useful, even for high-dimensional systems? We 
will explore two distinct strategies: binning and also the use of configuration-space distances ( "nearest- neighbors 
strategy"). For high-dimensional systems, we will demonstrate that not all coordinates need be considered in the 
analysis. We emphasize that other strategies for estimating density are possible [22,23]. 

The test systems chosen for this study are a one-dimensional double- well potential and a 50-atom dileucine peptide 
(ACE-(leu)2-NME). Our reasoning for choosing these systems are: (i) The system sizes are small enough that it is 
possible to compute an independent AF estimate via counting, i.e., using Eq. Q. (ii) Both are non-trivial: the 
one-dimensional potential has high barrier, and dileucine has large side chains and thus 144 degrees of freedom. 

2.1 Binning strategy 

In the simplest use of bins, the observed density is estimated based on simple counting: omitting normalization, 
p ohs (j) — fi-b(i), where n>b(j) is the number of configurations in the bin which includes configuration j. Indeed we 
have tried this scheme and it works, but not optimally. 

A more effective procedure for weighting individual configurations combines simple counting with the assumption 
of local equilibration: within bins, configurations are assumed distributed according to p itself. This assumption could 
correspond to the case where separate canonical simulations have been performed in different regions of configuration 
space, or to a canonical simulation which was not equilibrated on long time scales but well-sampled locally. The 
local-equilibration estimate for the observed distribution is 

P° hs U) = n b (j)p(j)/p ] , (6) 

which implies w hh (j) = Pj/n^j). The bin-specific normalization is the average relative target probability in the bin, 
namely, pj = X^ebin^W / n b(j) ■ This normalization serves to keep the total observed probability of a bin proportional 
to the number of counts, and independent of the local Boltzmann factors. As explained in the Supplementary 
Material, a consequence of adopting the approximation ([6|, is that pj is implicitly assumed proportional to the local 
partition function for the bin containing configuration j. (Each bin has a different local partition function estimate.) 
Nevertheless, we emphasize that the use of local partition functions is not intrinsic to the black-box approach, but 
only to binning methods for estimating p ohs : see, for instance, the nearest-neighbor strategy described below. 

One- dimensional double-well potential. Here we consider the smooth potential of Fig. |2ji. The relative target 
probability is defined by p[x) = e~ u ( x '' kBT , w i aere JJ(x) is the potential energy sketched for the approximately 6 
ksT barrier height used here. The barrier heights for this potential, as well as basin curvatures and separations, are 
fully adjustable, using, U(x) = - In [cxp ( - 0.5(a;-x Q ) 6 - (x - x a ) 2 /2sl) +exp ( - 0.003(a; - x&) 4 - (x - Xb) 2 /2sl)] , 
where we have used x a — 2.0, Xf, — 8.0, s a = 0.5, and s& = 1.5. We estimated the ratio of partition functions for 
the two wells (right over left) using data obtained from ordinary canonical simulation, i.e., Metropolis Monte Carlo 
(MC) [24]. 

Fig. [2] shows the differences between re-weighting, using a binning strategy, and canonical estimates of the ratio 
of partition functions between two wells (right over left) in (a). Results shown are the averages (b) and standard 
deviations (c) from 10 3 independent computations. The canonical estimates (squares) were computed via Q, and the 
black-box estimates (circles) via Eqs. ([5| and (|6| with a bin size of 0.005. All data were generated using the same MC 
trajectories, each initiated from the left basin of the potential in (a). Specifically, all MC trajectories were generated 
by starting the system at x = 2.0. Each trajectory consisted of 10 6 trial moves generated by adding a uniform random 
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Figure 2: Results for a one-dimensional double- well test system, (a) The potential energy plotted as a function 
of the coordinate, (b) The ratio of partition functions between the left and right wells. All results shown are the 
average of 10 3 independent computations. The canonical estimates (squares) are given by Eq. Q, the black-box 
estimates (circles) by Eq. ([5|. An independent estimate, obtained via integration, is shown by the solid horizontal 
line, (c) The standard deviation from the same data as that for figure (b) is plotted showing the rapid decrease in 
the standard deviation for our method (circles) as compared to canonical (squares), (d) Estimate of the ratio of 
partition functions using our re- weighting approach via Eq. ([5]), based on a heavily biased ensemble of configurations. 
Note that canonical estimation using Eq. (E| is not possible. 



deviate between -1.0 and 1.0 to the current position. An independent estimate, obtained via numerical integration, 
is shown by the solid horizontal line in (b). Note that in (b) the apparent accuracy of canonical estimation at short 
times represents only the average behavior; the statistical uncertainty is so large that any individual estimate would 
be completely unreliable. 

Figure p2p demonstrates that the error in our re-weighting analysis decreases at a rate greater than the typical 
square-root behavior, as seen for the canonical error. Since the same ensembles were used for both the re-weighting 
and canonical estimates, this improvement may lead one to believe we are getting something for nothing. In fact, 
there are costs, but in essence, they have already been paid: (i) The re- weighting approach requires knowledge of the 
desired distribution; in our case (and most other cases of interest) this is given by the Boltzmann factor, (ii) The 
observed density is available information that is not generally used for analysis of simulation data, and requires only 
a small additional computational cost. 

In Fig. [2]i we show that our re-weighting approach can be used to estimate the correct populations between the 
left and right wells, with an example that simply cannot be treated canonically. We generated a biased ensemble, 
with configurations that were not distributed according to p (Boltzmann factor for U shown in (a)). Then Eqs. ^ 
and (|5| were used with a bin size of 0.005 to estimate the ratio of partition functions. The data show the averages 
(circles) and standard deviations (error bars) for 10 3 independent computations. Since our re-weighting approach is 
independent of the method used to generate the ensemble, we are able to obtain the correct ratio, even though the 
ensemble used in the analysis was heavily biased. For completeness, we note that the biased ensemble configurations 
were generated using the same MC protocol as above, but with a square well potential defined as U(x) = between 
x = and x = 15 and infinite elsewhere. 

Dileucine peptide. In this high-dimensional molecular example, we again apply our approach to a biased ensemble, 
where canonical estimation would be meaningless. We use the method to estimate the free energy difference AF 
between conformational states of the 50-atom dileucine peptide based on independent simulations of each state. The 
two states considered were defined by backbone dihedrals angles — "alpha": —140.0 < $1.2 < —80.0 and — 120.0 < 
#1,2 < -60.0; and, "beta": -125.0 < $ 1)2 < -65.0 and 120.0 < #1,2 < 180.0. 

To test the accuracy of the results from our re-weighting approach, we also generated an independent estimate of 
AF. An ordinary, unrestrained, 1.0 /is simulation was performed and analyzed via Eq. Q, yielding Ai^ nc i — 3.3±0.1 
kcal/mol, with statistical error estimated via block averaging [25]. The simulation data were generated using tinker 
4.2 with the OPLS-AA forcefield [26]. A 1.0 fs timestep was utilized, and the simulation was coupled to a 300.0 
K bath using the Andersen thermostat [27]. Solvation was provided by the generalized Born surface area (GBSA) 
implicit model [28]. 

To generate a BBRW estimate based on the locally equilibrated scheme of Eq. ^ , simulations of dileucine were 
performed constrained to either the alpha or beta state. We generated 500,000 configurations in each state based on 
a total of «104 ns. Bins used in ^ were constructed uniformly using only the four backbone dihedrals — out of 144 
degrees of freedom — leading to a four-dimensional histogram. Note that empty bins, which are expected for physical 
and statistical reasons, simply do not contribute to averages computed via Eq. ([3| , since only sampled configurations 
are summed over. 

Figure [3^t shows that our re- weighting approach can be used to correctly predict the free energy difference between 
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Figure 3: (a) The free energy difference between the alpha and beta states of dileucine from binning. Results from 
our re-weighting approach using a heavily biased ensemble are shown. Data were generated using Eqs. ((2J, (|) and 
([6]), using only the backbone dihedrals <& and 'J. The independent estimate is shown by the solid horizontal line with 
an approximate error determined by block averaging [25] . (b) Box-counting analysis of dileucine configurations used 
to determine the range of bin sizes for figure (a). 



the alpha and beta states of dileucine. The solid horizontal line is the independent estimate Ai 7 ^ computed via 
Eq. Q with an approximate error determined by block averaging [25]. The circles are results from our re- weighting 
method using the backbone dihedral angles $ and 'J. These data were computed using Eqs. and Q, and 

plotted as a function of the number of bins used per dihedral angle in the histograms. Even though an equal number 
of alpha and beta configurations were utilized in our analysis, the ratio of ~ 250 is recovered. Note that our strategy, 
as shown here, is an example of an end-point free energy difference method, since no path connecting alpha and beta 
is needed (e.g., Refs. [15-21]). 

Fig. [3^, also demonstrates a certain robustness in the approach: there is a broad range of bin sizes that can be 
used generate the correct free energy difference. 

Two criteria were used to determine the range of bin sizes used for the re- weighting results in Fig. [3Jl: (i) The range 
of bin sizes for accurate AF estimation apparently corresponds to the power-law regime of a plot for estimating the 
box counting dimension [29]. This plot is shown in Fig. |3j) where the linear regions denote the power-law behavior, 
(ii) The extreme bin size results are known exactly. If one bin per dihedral angle is chosen, then the estimate will 
always be poor and AF — (for equi-sized ensembles). At the other extreme, if a very large number of bins per 
dihedral angle is used, then each bin will have a count of 1, and p obs = 1 for every structure. Since no true density 
information is obtained near either extreme, we excluded such estimates. 



2.2 Nearest-neighbors strategy 

In addition to the binning implementation above, we now estimate p ohs of Eq. ([2| by computing distances between 
conformations. The motivation behind pursuing such non-binning approaches is the realization that for biomolccular 
systems, it may prove difficult to populate the necessary high-dimensional histogram bins. The number of bins will 
increase exponentially with the number of coordinates, but the density of sampling in configuration space can always 
be estimated based on "distances" in configuration space. 

In general, to implement the nearest-neighbors strategy [30], the distances between all configurations in the 
ensemble are computed, using any metric which preserves the Cartesian volumes intrinsic to partition functions. By 
definition, the dimensionality of the metric can be as large as the number of degrees of freedom needed to describe 
a conformation. The observed configuration-space (relative) density for structure j is then given by 

oha(j) iVdistC?) (7) 

which implies w hh = p(j) i?hyp(j) d /^Vdist0)- Here, Rhyp(j) is the radius of the hypersphere centered on structure j 
enclosing Ndist(j) structures, and d is the dimensionality of the distance metric used. Typically, as we will demonstrate 
below, d can be chosen to be much smaller than the dimensionality of the system. See Supplementary Material for 
further discussion of the approximations entailed when d is less than the full dimensionality of the system. 

We used Eq. to estimate p obs for a particular structure j using the following implementation: (i) Compute 
the distance (using any appropriate metric) between structure j and all other structures in the ensemble, (ii) Choose 
a value of iVdist- The radius of the hypersphere -Rhyp is given by the distance to the iVdist nearest neighbor. For 
example, if iVdist = 10 is chosen, then -Rhyp is given by the distance to the tenth nearest neighbor. 
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Figure 4: The free energy difference (AF) between the alpha and beta states of dilcucinc from a nearest-neighbors 
strategy. Results from our rc-wcighting approach using a heavily biased ensembles are shown. Data were generated 
using Eqs. ^ and with distances computed via Eq. ([8]). The analysis used only the backbone dihedrals <& 
and 'J. Data are given as a function of the number of distances (neighbors) used in the analysis. The independent 
estimate is shown by the solid horizontal line with an approximate error determined by block averaging [25]. 



Dileucine peptide. We tested the approach embodied in Eq. ^fj by estimating the free energy difference (AF) 
between the alpha and beta conformations of dileucine, using the same restrained (thus biased) ensemble that was 
used above. Following the previous success of using only the torsion angles of dileucine, we chose to use a dihedral 
angle distance between structures j and k defined by 



£(<^-4) 2 > ( g ) 

2 = 1 

where <fi could be any dihedral angle (e.g., $i, ^2, Xi)i an d d is the number of angles used in the computation, i.e., 
the metric dimension. 

Figure [4] shows that our re- weighting approach, using the nearest-neighbors strategy, can be used to predict the 
free energy difference between the alpha and beta states of dileucine. The solid horizontal line is the independent 
estimate AF- m ^ computed via Eq. Q with an approximate error determined by block averaging [25]. The circles 
are results from our re- weighting method using the backbone dihedral angles $ and ^ . These data were computed 
using Eqs. ([2]), |5|, and 0, with distances given by Eq. The results are plotted as a function of the number 
of neighbors used in the analysis procedure, iVdist- Even though an equal number of alpha and beta configurations 
were utilized in our analysis, the ratio of ~ 250 is again recovered. 

The accuracy of the nearest-neighbors strategy (or at least the way we implemented the idea) is lower than that 
of the binning approach; see Fig. [3^. However, the results in Fig. [4] are encouraging, and demonstrate that it is 
possible to estimate p obs using a non-binning procedure. 

2.3 Note on efficiency 

It is worthwhile to briefly discuss the efficiency of the re-weighting approach using Eq. ^ compared with that of 
using Eq. Q. There are, in general, two extreme cases to consider: (i) The barrier between the states of interest 
is very low. This implies that barrier crossing events occur frequently, and thus counting via Eq. Q will likely be 
the most efficient approach, (ii) The barrier between the states is too high to cross during ordinary simulation. In 
this case, counting via Eq. Q is not possible. Importantly however, the re-weighting approach of Eq. ^ can still 
be utilized since all that is required are independent simulations in each state. 

The fairly lengthy simulations required for the BBRW method in the case of dileucine reflects the choice of state 
definitions: within each state, i.e., alpha or beta, the rotameric (side-chain) timescales are substantially longer than 
those of the backbone torsions. 



\ 



2.4 Dimensionality 

With dileucine we were able to obtain accurate results for p ohs , and thus AF, using only 4 degrees of freedom — out 
of a possible 144. The underlying assumption is that the the degrees of freedom not included in the analysis will 
provide the same contribution to each relative weight (see Supplementary Material for details). Thus, at a minimum, 
the coordinates used to define the states of interest must be included in the re-weighting analysis (here, the backbone 
dihedrals <& and ^). 
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A particularly promising strategy for the future is to use principal components analysis, or related methods, to 
suggest an optimal reduced set of coordinates [15,31,32]. These collective coordinates can then be used for binning 
or to determine configuration-space distances for the nearest-neighbor approach. 

3 Conclusions 

The black-box re-weighting (BBRW) strategy computes ensemble averages for any target distribution based on 
estimating the observed configuration-space density actually sampled in a simulation — without employing typical 
assumptions about sampling quality. In principle, the approach can be used to re- weight any ensemble, regardless 
of the process used to generate the configurations. Our data show that using BBRW can dramatically reduce both 
systematic and statistical error for some systems. 

As a proof of principle, we used the approach to estimate free energy differences (AF) for non-overlapping states 
of one-dimensional and molecular systems, based on completely independent ensembles generated in each state. 
Further, when ordinary one-dimensional trajectories exhibiting transitions between states were re-analyzed with 
BBRW, statistical error was substantially reduced. 

Practical application of BBRW hinges on estimation of the observed configuration-space density, p ohs , but the 
theoretical basis does not. We have obtained reasonable results using two different methods for estimating p obs , 
emphasizing that the central idea behind our re-weighting approach is independent of the specific strategy used to 
estimate p obs . Strategies beyond those examined in this report are available [22, 23] and will be explored in future 
work. 

There are two primary limitations of the overall approach. First, improved density estimation schemes will be 
needed for higher dimensional systems. Second, BBRW is an analysis method that re-weights configurations in 
existing ensembles; thus, it cannot predict populations for regions of configuration space not represented in the 
original data. 

Our motivation for the black-box approach stems from the still-outstanding challenge of producing statistical en- 
sembles for full-sized proteins. The black-box approach seems promising in this context because: (i) Non-Boltzmann- 
distributed sets of diverse structures can be generated using a variety of methods, including NMR structure prediction 
software [7], and by the addition of atomic detail to coarse models [8,33]. (ii) Local canonical sampling based on 
ad hoc starting structures is readily possible with existing software, (iii) Our own box-counting studies of proteins 
(unpublished) suggest that folded proteins act as systems with dramatically reduced dimensionality; see also [31,32]. 
We also hope that BBRW will prove useful in rc-wcighting implicitly solvated biomoleculcs into explicit solvent 
ensembles. Finally, the idea could prove of value in the analysis of data from insufficiently converged canonical 
simulation of any type. 
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